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ABSTRACT 


A recent advance in computer architecture, the parallel processor computer, has 
made it theoretically feasible to reduce the time required to integrate a system of n 
ordinary differential equations by a factor of n. One established numerical technique, 
the Runge-Kutta-Fehlberg method, is adapted for parallel processing on an Intel 
SClemuleme One itch s@m@oncurrent supercomputer. he algorithm is evaluated 
using a standardized collection of systems of equations. It 1s concluded that this type 
of parallel processor is not suited for the solution of this problem due to the 
conimunications overhead required. Short developments of ordinary differential 


equations and numerical integration methods are provided as background. 
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I. PRELIMINARIES 


A. INTRODUCTION 

With the advent of modern computers, the implementation of numerical methods 
for the solution of svstems of ordinary differential equations has become commonplace. 
Rather than developing new methods, the task at hand has become making current 
methods more efficient by reducing the amount of computer time needed to solve a 
system or by increasing the accuracy of the results. Previously, efforts to accomplish 
this task were centered on modifving existing algorithms. 

A recent advance in computer architecture, the advent of parallel processing, has 
made it theoretically feasible to reduce the time required to integrate a system of n 
differential equations bv a factor of n, assuming the parallel processor computer 
possesses n or more processors. This is a very significant time savings compared to 
those previously realized. This savings is provided by the parallel processor's ability to 
perform n different tasks, e.g. integration of n differential equations, simultaneously 
rather than sequentiallv as is done with a computer possessing a single processor. 

In this thesis, one established method for solving systems of differential 
equations, the Runge-Kutta-Fehlberg method, 1s adapted for parallel processing on an 
Intel Scientific Computer 1PSC Concurrent Supercomputer” The aleoritimas nem 
evaluated using a standardized collection of systems of equations. It is found, 
however, that this type of parallel processor is not suited for this purpose due to the 
communications overhead required. As background, short developments of ordinary 


differential equations and numerical methods are presented. 


B. ORDINARY DIFFERENTIAL EQUATIONS 

An equation that involves derivatives or differentials of a function or functions 1s 
called a differential equation. Differential equations are further classified as ordinary 
or partial differential equations. If the equation is a function of ordinary derivatives, it 
is an ordinary differential equation (ODE). If it is a function of partial derivatives, it is 
a partial differential equation (PDE). The order of a differential equation is the value 
of the highest derivative which appears in the differential equation. For the purposes 


of this thesis, the only concern is that of ordinarv differential equations. 


A first-order ordinary differential equation 1s the simplest case. Consider 
v a {(x.¥). (1.1) 
This equation is a first-order ordinary differential equation. The general solution of 1.1 
is a one-parameter familv of curves. To select one member from this family, it is 


necessary to specify an initial value. That is to say the initial value of the dependent 


variable is specified for anv value of the independent variable. An example of this 
would be 


Y =~ {(x,¥), ¥(X,) = Yo: (ce) 


A system of n first-order equations is of the form 


a 


“ee 
So 


i es Yys ye: Stes Y. 


a 
= f5(x, Yye Ys Te 29 Be) (Le) 


Yn = PnOG Yq ¥or-+ + ¥p) 
Where the om (7 = 1, 2,...,n) are functions of the independent variable x and He a= 


ee oe eresienctionssor the vy. --,¥,. fhe solution to |.3 will be a family of 
ordered n-tuples of the form 


= ae io ok (1.4) 


Again, to select one member of this familv of n-tuples, it is necessary to have an initial 


Value. In this case, the initial value problem becomes a vector equation 


y = 1 X,) = co: (1.5) 


An nth-order ordinary differential equation of the form 


eee), a ae) (1.6) 


is solved by converting it into a set of n simultaneous first-order differential equations 


of the form 


Uy, = Uy = Hi upe ) 
oe = Uz on f(x, uy, ER re i, Un) (1.7) 
Oe eee se eee nee) 
ese pe 
by [ctu een =) toes yl), jonueaye Ue yin), by differentiating each of these 
equations: and by substituting for y, yh), eee yal) in terms Of Whal>, | a Ue 


order to determine a unique solution to this set of simultaneous equations, initial 
conditions must again be specified. These initial conditions are of the form uj(c) = 
dj, us(c) = ds5,..., u,(c) = d,, which are obtained from transforming the conditions 
in terms of y and its derivatives. For further discussion of the solution of ordinary 
differential equations, see [Ref. 1}. 

Given a system of differential equations, the problem now becomes one of 
solving the system. Whenever possible, it 1s desirable to find an explicit solution. 
However, most systems cannot be solved exactly--that is, it 1s impossible to obtain a 
solution in elementary form. It is because of this characteristic of ordinary differential 


equations that we must turn to numerical methods to obtain the solutions. 
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If. DEVELOPMENT OF NUMERICAL METHODS 


A. TAYLOR SERIES METHOD 

Although the Tavlor series methods are not usually used in practical problems, 
these methods must, however, be understood in order to understand the Runge-Kutta 
methods described in the next sections. In order to solve the initial value problem 
posed in Equation 1.2, we develop the relation between v and x bv determining the 
o HH y(x) 


has m+] continuous derivatives on an interval I containing x,, then by Taylor's 


coefficients of the Taylor series in which we expand y about the point x = x 
Formula with remainder, 


es) = ew) KX) et (1/2) yx MRK)? een (2.1) 
(ie! yy cx J (Gms yyy Me)(x-x yO"! , 


for some c € (X,X,). (Thomas and Finney [Ref. 2: pp. 663-665] show a detailed proof 
of this theorem.) If v(x) is a solution to the initial value problem 1.2 and y(x) has 


m+ 1 continuous derivatives, then 


Ce ud es) 
v(x.) = M9GLyG,) = & + fy =f + ff 
y'(Xp) = xv (Xp) = fy Ly - (yx iu by" ae (2.2) 


= 2 a 
eee tf tbat fF, 


Where f and its partial derivatives are all evaluated at (x,, ¥(X,,)). 

One could continue in this manner, computing any derivative of y evaluated at x,. 
yx, in terms of f and its partial derivatives evaluated at x y(x,,). It is obvious 
to see that for other than reasonably small values of m, the derivatives are usually 
bothersome to compute. For this reason, m in Equation 2.1 1s chosen to be reasonably 


small. By letting h, = (X-X,), ¥,4 1 18 approximated by 


to 
Us 
— 


@: 
Vatt = Yn t fp Yalhy + 2K, vena? +e. e 


+ emg Uc, ee 


Equation 2.3 represents a single step numerical approximation to the solution of the 
initial value problem 1.2 and is Known as the Tavlor series method of order m.! From 


Equation 2.1, the error of this method is represented by 
En = (mt name, yeyyn,mF |, (2.4) 


where Gi (axe): 

As seen above, the computational disadvantages of the Tavlor series method are 
due to the calculation and evaluation of the derivatives (1), fb2) ee fm-1) at (oe 
Y,,). It will be seen in the following sections that a Runge-Kutta method of order m is 
usually as accurate as the Taylor series of order m and is simpler to use. The Tavlor 
series method, however, will be shown to be of theoretical value since the order of a 


Runge-Kutta method will be defined using the Taylor series method. 


B. RUNGE-KUTTA METHOD 

The German mathemetician Carl Runge (1856-1927) was the first to develop a 
numerical integration technique designed to approximate the Taylor series method 
Without requiring explicit evaluations of derivatives beyond the first while preserving 
the accuracy of the Tavior series method [Ref. 3]. The technique was later improved 
by the German mathematician Martin Kutta (1867-1944) [Ref. 4] and, thus, it was 
namied the Runge-Kutta method. 

This technique sets up a problem with undetermined paramicters and uses 
evaluations of f(x,y) within the interval (x,, y,) and (X, 4 1, Yn4 1) thus bypassing the 
derivatives of the Taylor series by requiring f(x, v) to be evaluated a number of 


additional times within the interval. The general form of this scheme 1s 


Vv 
vn ] = Yn a > wiki (2.3) 
= 


where v is the number of f(x, y) substitutions, W; are the weighting coefficients, and the 


K; satisfy the sequence 


IPublished by the English mathematician Brook Taylor (1685-1731) in 1715, 
however, Gregory and Leibnitz knew the series before Taylor, and John Bernoulli had 
published a similar result in 1694 (Ref. 1: p. 106]. 


lee 


bw 


ky = bil ps Yn) 
K5 = hf(x, ie c5h, ¥n ae a5 ,ky) (256) 


x 
J 
I 


ee aeecol led 2302 


The values of k can be thought of as estimates of the change in y when x changes a 


value of h. The problem now becomes one of determining the coefficients w;, c;, and 


a 
aij. Each set of parameters will specify the points (x, y) where f{x. y) is to be 
evaluated. Therefore, this method calculates y, 4, using only a value for y, and 
evaluations of f(x, y) at points between x, and x, 4. For this reason the method is 
termed self-starting. To obtain specific values for the coefficients, a value for v is 
chosen and y, 4 1s expanded in powers of h such that it agrees as well as possible 
with the solution of the ordinary differential equation found using the Tavlor series 
Nel nOG wes a COmpiete development see | Ref. 5].) 

A popular example of this method is the classical fourth-order Runge-Kutta 


miethod. [t 1s given bv 


A eee se 2Ky se 2K5 + ky), (7) 
Rete eae ax Yo) 
Sue, 2, V.vaeka/ 2) 
Om s eo Va Kye) 
= hf(x, 7 hy, V, + he) 


In this case, we avoid the derivatives of the Taylor series by perfornung four function 
evaluations on {(x, y) in the interval (x,, y,) and (X, 4.1, ¥,+4]). As was stated earlier. 
the Taylor series method provides an error estimate for other methods. Here, as h goes 
to 0. this method agrees asymptotically with the Taylor series through the h? term, 
thus, making it a fourth-order method with error term proportional to Es from 
Equation 2.4. A disadvantage of this method is that an estimate of the local error 1s 


not readily available to help in choosing a suitable stepsize h. 


is 


C. RUNGE-KUTTA-FEHLBERG METHOD 

In 1969, Erwin Fehlberg published a variation of the Runge-Kutta method which 
uses an estimate of local error to select a proper stepsize [Ref. 6]. For a given value of 
Vy, Fehlberg’s method computes two estimates of y, 41 using fourth- and fifth-order 
Runge-Kutta formulas. In order to obtain an estimate of local error, the two values of 
Vn+ are compared. The stepsize is then adjusted, depending on the local error. 


Fehlberg first uses 


Ynt1 = ¥n + 2 ck; ae 


where the k; satisfy 


i-] 
ke Dea ea > Bik = 1, ae (2.9) 
i 


This method requires six function evaluations per step. As with the fourth-order 
method discussed in Section II.B, the c; are found by expanding vy, 4, in powers of h, 
so that it agrees as well as possible with the Taylor series solution. The coefficients 
determined by Fehlberg are found in Appendix A. Fehliberg found that the two 
expansions match until the h,° term, thus, making the method fifth-order. This is a 
departure from the behavior of the nth-order Runge-Kutta methods where n = 1, 2, 3, 
4 which produce (n+1)st order error. This partially explains the popularitv of the 
classical fourth-order method described in the previous section; it takes two more 
function evaluations to obtain one more order of accuracy. Fehlberg’s method, 
however, exploits the sixth function evaluation by determining a second value y, 4 )* 


using 


6 
var = Sn eee (2.10) 
— 


This value was found to be fourth-order using the method described above. The local 


error 1s then estimated by 


6 
y= » - C:*)K: (2.11) 
= 


Tanicheismisedmenr Stepwise commol. (Rel. 7pp. 129-131) 
Because the Runge-Kutta-Fehlberg method is generally thought of as one of the 
“best methods” available for solving nonstiff systems of equations [Ref. 8], it was 


chosen to adapt for parallel processing. 


II. DESCRIPTION OF SEQUENTIAL PROGRAM 


A FORTRAN program written by H.A. Watts and L.F. Shampine [Ref. 7: pp. 
132-147] was chosen to be implemented. The program solves initial value problems in 
ordinary differential equations and is based on the Runge-Kutta-Fehlberg method 
described in Section II.C. It 1s designed to solve non-stiff and mildly stiff systems of 
differential equations when derivative evaluations are inexpensive. The program is 
typically used to integrate from a given initial value to a desired final value but can 
also be used as a one-step integrator. The program consists of a main program along 
with subroutines RKF45, RKFS, and FEHL. The following is a brief description of 


the program as it was developed for sequential processing. 


A. MAIN PROGRAM 

The main program first defines the system of equations to be solved through the 
use of the problem specific subroutine F. Additionally, it defines the system’s initial 
conditions and program parameters such as absolute and relative error tolerances, and 
It provides output of data and error messages. Once the system of equations and 
parameters are defined, the main program begins solution of the problem by passing 


information to subroutine RKF4S., 


B. SUBROUTINE RKF45 

Once the main program sets up the problem, subroutine RKF45 becomes the 
interfacing routine for the solution of the problem. RKF4S5 first sets up work arrays 
for storage of information used during integration, thus relieving the user of lengthy 
subroutine calling lists later in the program. It then calls subroutine RKFS, providing 


it with the work arrays. 


C. SUBROUTINES RKFS AND FEHL 

RKFS is the subroutine which, along with subroutine FEHL, performs the 
integration of the system of equations. It first establishes a minimum acceptable 
relative error and a maximum number of function evaluations allowed in order to avoid 
the expense of a user’s attempt to obtain an excessive accuracy. It next checks input 
parameters, issuing error flags back to RKF45 as appropriate. Machine epsilon is then 


computed and used in conjunction with the minimum acceptable relative error to limit 
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precision difficulties. Error flags are issued if user specified relative error tolerance is 
too small. Once these preliminary tasks are complete, initialization is performed. This 
includes setting the function evaluation counter to zero and estimating the initial 
integration stepsize H. Throughout the program, the stepsize is not allowed to become 
smaller than 26 units of roundoff in the dependent variable T. 

Once stepsize is computed and checked, subroutine FEHL is called. Subroutine 
FEHL contains the heart of the integrator in the form of the FORTRAN equivalent of 
the Runge-Kutta-Fehlberg formulas represented in Equations 2.8-2.10. Because of its 
importance, subroutine FEHL is included as Appendix B. FEHL performs the 
integration and returns to RKFS which in turn implements Equation 2.11 in order to 
determine local error and test to see if the integration step was successful. If 
unsuccessful, the stepsize is reduced and integration is attempted again. If successful, 
the solution at [+H is stored and the components of the system of equations are 
reevaluated at T+H using subroutine F. The function evaluation counter is changed 
to reflect the function evaluations performed in FEHL. This integration process 
continues until the final location is reached causing program flow to return to the main 


program which provides output of the final solution. 
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TV. DMVPLEMENTATION 


A. METHODOLOGY 

In order to test the hvpothesis that the time required to integrate a system of n 
ordinary differential equations can be reduced by a factor of n through parallel 
processing, the program described in Chapter III was implemented on an Intel 
Scientific Computer iPSC Concurrent Supercomputer. Appendix C contains a 
technical description of this computer. The particular computer used in this thesis 
possesses 16 processors thus making it a 16-node or 4-dimensional hypercube, as 
explained in the appendix. 

The general scheme of the testing of this hvpothesis was to choose systems to be 
integrated from a standard suite of problems used to test the performance of other 
integrator programs [Ref. 9: pp. 617-621]. Four systems were chosen in order to test 
performance of both small and moderate sized systems. As was done in the reference, 
the interval of integration for all implementations was [0, 20]. The constraint of only 
examining small and moderate sized systems was imparted due to the number of 
available independent processors being 16 or less. The systems chosen consist of 2 
equation, 3 equation, 4 equation, and 10 equation svstems. Each problem was first 
solved sequentially and timed on a single processor of the hvpercube by adapting the 
code previously described, thus providing a sequential time standard to be compared 
with parallel run times. Next, the Runge-Kutta-Fehlberg algorithm was adapted for 
parallel processing using varying schemes to optimize performance of the hypercube. 
the systems were then solved using these schemes and timed. Detailed descriptions of 
each system’s implementation are contained in this chapter following a discussion of 


internodal communication times. 


B. INTERNODAL COMMUNICATION TIMES 

The theoretical feasibility of reducing integration time by a factor of n assumes 
that the time required to pass information between processors is minimal when 
compared to the speed of computation. Time spent in communicating 1s a critical 
factor in the implementation of an integrator of systems of ordinary differential 
equations since, after each integration step, the solutions to each component of the 


system must be combined at a central location. This requires, for every step in the 
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integration, a miessage pass back to a central node from each node tasked with 
processing a separate component of the svstem of equations. For this reason, the 
hypercube’s internodal communication times were empirically determined for later use 
in minimizing total time spent in communication when implementing the parallel 
algorithms. 

Based on the topology of the 4-dimensional hypercube, as is discussed in 
Appendix C, it was decided to determine communication time for a message sent round 
trip from the host to the cube as well as between two nodes of distance 1, 2, 3, and 4 
from each other. For these timings, a message of length 4 bytes was sent round trip 
1000 times and an average round trip time was calculated. This experiment was 
performed a total of ten times for each and a final average round trip time was found. 
From the host to the cube, average round trip time was .02308 seconds. Average 
round trip times between nodes whose internodal distances are 1, 2, 3, and 4 were 
002709 seconds, .005317 seconds, .006615 seconds, and .007797 seconds respectively. 
In addition, message passing was also timed for messages of length 16, 32, 40, and 80 
bytes. These times were similar to those found using the 4 byte long message, thus 
showing that internodal communication times are not a function of message length. 

Three conclusions may be drawn from these results. First. when minimization of 
communications time 1s desired, the host should be used only to house the main 
program and provide input and output. It should not be used as a “seventeenth” node 
due to the high relative order of host to cube communication time as compared with 
internodal communication times. Secondly, to minimize the total communication time 
in a given parallel algorithm, internodal distances must be considered when assigning 
tasks to specific nodes. Thus, in order to attain minimum program run times, parallel 
algorithms must create an optimum hypercube topology for a given system of 
equations based on internodal distances. Thirdly, since communication time is not 
message length dependent, it can also be concluded that a single long message 1s 


preferred to several short messages containing the same information. 


C. A TWO EQUATION SYSTEM 
1. System Description 
Equation 4.1 depicts the two equation system chosen. It represents the 


growth of two conflicting populations. 


Mi ~ ey ¥i¥5), SO) A (4.1) 
[> me oom Ys). Ne eaa ee 


2. Sequential Implementation 

Sequential implementation was accomplished by adapting the integrator 
program described in Chapter III to run at node 0. This adaptation includes a 
subroutine F tailored to Equation 4.1. A main program, running at the host, loaded 
the integrator program to node 0 and provided output of integration results and 
sequential run times. It should be noted that these run times do not include time 
necessary to load the integrator program to node 0. 

3. Parallel Implementation I 

The parallelization scheme thought to be most natural, 1.e. sending component 
1 to node 1, component 2 to node 2, ... , component mn too node ies aie 
implemented. This scheme was termed “shotgun” and consists of a main program at 
the host, the modified integrator program less subroutine FEHL at node 0, and node 
programs at nodes | and 2. Excerpts from the node 0 program and the node | and 
node 2 programs are listed in Appendix D. 

Again, the main program loads the node programs and outputs integration 
results and parallel run times. The node 0 program is the driver program for the 
integration. It has been modified by removing subroutine FEHL and adding 
communications with nodes | and 2 which evaluate components | and 2 of the system 
respectively. It should be noted that these node assignments were made in order to 
insure internodal distances were minimized. Referring to Figure 4.1, the reason that 
this process was termed “shotgun” was due to the fact that the node 0 program sends 
and receives information from the nodes processing the component computations in a 
“shotgun” fashion. 

A typical integration step takes place as in the sequential program except that 
instead of calling subroutine FEHL, the node 0 program first sends the component 
nodes the initial y vector. Although this message passing is sequential, the 
computation at the nodes does overlap, providing concurrent component processing. 
The component nodes perform the first function evaluation, using subroutine FNODE, 
and the first Runge-Kutta-Fehlberg step. This information 1s then sent back to node 0. 
This process takes place five times per integration step. Node 0 then computes a 
solution at the new location T+H, computes an appropriate stepsize, and again calls 


the component nodes to continue integration. 


In general, the “shotgun” scheme may be extended to a 16-node hypercube to 
integrate systems up to size fifteen. For this scheme, the cost in number of message 


transmissions 1s represented by Equation 4.2. 
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Figure 4.1 Shotgun Scheme. 


COST =2e NEON X NEE (4.2) 


Where NEQN is the size of the system and NFE ts the number of function evaluations. 
Since the number of function evaluations increases as the error tolerances are 
decreased, integration times can be expected to increase due both to the increased 
number of computations required, as well as the increased communications overhead 
requirement. For this system of two equations, the communications overhead in terms 
of the number of message passes performed is four times the number of function 
evaluations. 
4. Parallel Implementation I] 

In order to reduce the communications overhead present in the “shotgun” 
implementation, a second integration scheme was developed. This new scheme, termed 
“flip-flop”, involves sending information from node 0 to nodes 1 and 3 and then 
performing the integration step through a series of computations at these two nodes 
and message passes between them. Figure 4.2 depicts this scheme and program 
excerpts are contained in Appendix E. 

Again the program at the host provides loading of the three node programs 


and output of the results. The node 0 program is the driver for the integration. The 
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Figure 4.2 Flip-flop Scheme. 


integrator segment of the original sequential code was again modified. It now only 
sends and receives one message from nodes | and 3 for a total of four message passes 
per integration step. Node | computes function evaluations for the first component, 
node 3 for the second. First, node 0 sends the initial information to each component 
node and these nodes compute the first function evaluation for the respective 
component. Once completed, the two component nodes exchange information in a 
“flip-flop” manner and then proceed with their respective second function evaluation. 
This process continues until all of the function evaluations in the Runge-Kutta- 
Fchilberg scheme are completed at the component nodes, at which time nodes | and 3 
transmit their final component solutions back to node 0. The node O integrator 
program proceeds by advancing the integration as was done in previous programs. 

[In terms of communication overhead, the “flip-flop” scheme 1s superior to the 
“shotgun” scheme. The cost, in terms of message transmissions, is represented by 
Equation 4.3, 


GON 2S % Nive (4.3) 
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Where NFE is the number of function evaluations. The cost 1s derived from the fact 
that fourteen message transmissions occur during the computation of five function 
evaluations. These include ten between nodes 1 and 3, two between node 0 and node 
1, and two between node 0 and node 3, as shown in Figure 4.2 . This cost 1s thirty 


percent of the cost of the “shotgun” algorithm for the two equation system. 


D. A THREE EQUATION SYSTEM 
1. System Description 
The three equation system chosen 1s depicted in Equation 4.4 and represents a 


linear chenvical reaction. 


i a ee oes 2 
see oo See) — 9, (4.4) 
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2. Sequential Implementation 

Sequential implementation was performed in the same manner described for 

the two equation system and by modifying the subroutine F to reflect Equation 4.4. 
5. Parallel Implementation I 

The three equation system was first run parallel using the “shotgun” 
integration scheme with node O for the integrator program and nodes 1, 2, and 4 for 
the component programs. Nodes 1, 2, and 4 were chosen to again nuninuze internodal 
distances. Program excerpts are not included for this implementation as they are minor 
modifications of those found in Appendix D. From Equation 4.2, it can be deternuned 
that the the cost of solving the three equation system by the “shotgun” method is equal 
to six times the number of function evaluations. 

4. Parallel Implementation I] 

An additional scheme was developed to decrease the total integration time for 
the three equation system. It was termed the “train” scheme due to its use of messages 
in the form of a real valued vector of size seventeen which is passed from node to node 
during integration. The vector contains information necessary for integration including 
values for T, stepsize H, y values, v’ values and computed derivatives. Upon the 

lessage’s arrival at a node, the node performs function evaluations for its respective 
component, updates information in the message, and passes it on. An analogy can be 
made between the process of updating information in the message and filling cars in a 


train; thus the name “train.” 
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Figure 4.3 Three Equation Train Scheme. 


Figure 4.3 depicts this integration scheme and program excerpts are Hsted in 
Appendix F. Node 0 is used to run the driver program while the three components of 
the system are processed at nodes 1, 2, and 3. Nodes I and 2 were chosen since they 
are of mintmal distance to node 0, whereas, node 3 was chosen for its adjacency to 
node I. 

Referring to Figure 4.3 , the “train” scheme will be explatned in terms of 
message pass time frames, meaning the time frame during which a message ts passed 
between two nodes. In the first time frame, node 0 computes the first Runge-Kutta- 
Fehlberg function evaluation and sends the information “train” to node 1 which 
performs the second function evaluation. While this computation 1s being performed, 
the second time frame begins when node 0 sends the integration information to node 2. 
Upon completion of its computation, node | updates the information pertaining to its 
component and sends the “train to node 3) (At tins cit snodes 2 ania ane 
performing their computations for the first function evaluation. As a worst case, it 1s 
assumed that node 3 sends its updated information to node 0 during the third time 
frame and that node 2 returns its information during a fourth time frame. Throughout 
this process, as information 1s received at node 0, the updated values are placed in a 


buffer unul all component nodes return their updates. This completes one function 
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evaluation which takes a maximum of four message pass time frames during which five 
message passes occurred. This cycle is repeated until six function evaluations have 
deen performed. Upon completion of the function evaluations, node 0 selects a new 
stepsize H and continues integration at the new location I +H. 

In this topology, for one function evaluation. five message passes have been 
accomplished in the time associated with four. This gives a cost, in terms of message 


trar.snussions, as 


COST = 4 X NFE (4.5) 


where NFE is the number of function evaluations. This value is two thirds of the cost 


associated with the “shotgun” scheme for the three equation system. 


E. A FOUR EQUATION SYSTEM 
1. System Description 
The four equation system implemented is a two body orbit problem and is 


represented in Equation 4.6. 


a se a et = S, 
fo" Ye HAO) = 0, (4.6) 
¥3, ey yo°)? < (0) - 0, 
7,3. . 
a ee oO) = (1 + Bt = 2), 


€— 7 lerereds the eccentricity of the orbit. 


2. Sequential Implementation 
Sequential implementation was performed in the same manner as described for 
the two equation system and by modifying the subroutine F to reflect Equation 4.6. 
5. Parallel Implementation | 
The four equation system was first run parallel using the “shotgun” integration 
scheme with node 0 for the integrator program and nodes |, 2, 4, and 8 for the 
component programs. Again, these nodes were chosen to minimize internodal 
distances. Program excerpts are not included for this implementation as they are minor 
modifications of those found in Appendix D. From Equation 4.2, the cost of solving 
the four equation system with the “shotgun” method is equal to eight times the number 


of function evaluations. 
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4. Parallel Implementation I] 


The four equation system was also run using the “train” scheme. As shown in 





Figure 4.4 Four Equation Train Scheme. 


Figure 4.4, this application employs nodes 0 through 4. Nodes 1, 2, and 4 were chosen 
for their adjacency to node 0 and node 3 was chosen for its adjacency to nodes | and 2. 
As in the three equation application, node O runs the driver program. Nodes | through 
4 process their respective components as was done by the component nodes for the 
three equation systein. 

The integration process can again be described using message pass time 
frames. In the first time frame, node 0 makes the first Runge-Kutta-Fehlberg function 
evaluation and sends a vector message containing twenty-two pieces of information 
necessary for integration to node I. Node 1 computes the second function evaluation 
for its component. In the second time frame, the “train” message is again sent out by 
node J, this tme to node 4. Node 4 in turn computes its function evaluation and sends 
updated information to node 0 where it is stored in a buffer until the other “train” 
message arrives. During the second time frame, node | also sends its updated message 
to node 3. In the third time frame, node 3 computes and sends its results to node 2. 


finally, in the fourth time frame, node 2 computes new information and sends the 


message containing updated information from nodes 1, 2, and 3 to node 0. In all, four 
message pass time frames have elapsed upon node 0 receiving all component 
information necessary for continuing integration. This cycle repeats until all six 
function evaluations are accomplished, causing node 0 to recompute stepsize H and 
continuing integration at IT +H. 


‘As with the three equation “train” implementation, the communications cost is 


COST = 4 X NFE (4.7) 


where NFE is the number of function evaluations. This is half the cost of the four 


equation “shotgun” scheme. 


F. A TEN EQUATION SYSTEM 
1. System Description 
The radioactive decay chain problem listed in Equation 4.8 was chosen as the 


ten equation system. 
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2. Sequential Imp!ementation 
Sequential implementation was performed in the same manner as described for 
the two equation system and by modifying the subroutine F to reflect Equation 4.8 . 
5. Parallel Implementation | 
The ten equation system was first run parallel using the “shotgun” integration 
scheme with node 0 for the integrator program and nodes | through 1!0 for the 


component programs. In this case, internodal distances were not considered in the 
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node assignments. Program excerpts are not included for this implementation as they 
are minor modifications of those found in Appendix D. From Equation 4.2, the cost 
of solving the ten equation system with the “shotgun” method 1s equal to twenty times 
the number of function evaluations. 
4. Parallel Implementation II 
The ten equation system was also implemented using the message “train” 
technique. Figure 4.5 depicts the topology of the implementation. Nodes 0 through 10 
were employed using node O as the driver and nodes 1 through 10 for component 
programs. Minimization of internodal distances was taken into account. As can be 
seen in Figure 4.5 , the four equation algorithm was modified to have three message 
“trains.” Information from nodes 1, 2, and 3; nodes 4, 5, 6, and 7; and nodes §, 9, and 
10 is contained in each of the three messages. The computation flow is the same as the 
four equation algorithm except that node 1 transmits the message “train” to node 5 
prior to performing its own function evaluation, then sends the updated message to 
node 3. Six time frames elapse during the course of a single function evaluation. 
During this period, thirteen message passes occur. The cost in communications 


overhead can be expressed as 


COST = 6 X NFE (4.9) 


where NFE is the number of function evaluations. When compared to the “shotgun” 


implementation, a seventy percent time savings 1s theoretically attained. 
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Figure 4.5 Ten Equation Train Scheme. 


V. RESULTS AND CONCLUSIONS 


A. RESULGS 

Results of the sequential and parallel implementations described in Chapter IV 
are displayed graphically in Figures 5.1 through 5.4. For all implementations, the 
interval of integration was [0, 20], the absolute error tolerance set equal to 0.0, and the 
relative error tolerance was varied for each system as indicated below. In all four 
cases, the sequential implementation was fastest while the “shotgun” scheme was 
Slowest. The “flip-flop” and “train” schemes resulted in reduced run times from those 
of the “shotgun” scheme. It is clear that these reductions, resulted from the lowering of 
the communications overhead required in the “shotgun” scheme. 

1. Two Equation System 

For all implementations, the two equation system was solved using relative 

error tolerances of 10 for n = 3, 4, ... , 8. The number of function evaluations 
performed was the same for all implementations and ranged from 403 to 2622 


3 and 10°8 respectively. The “shotgun” 


corresponding to relative error tolerances of 10° 
scheme resulted in run times approximately eleven times longer than the sequentia! run 
times. The run times for the “flip-flop” implementation were approximately three and 
one half times those of the sequential runs and resulted in a seventy percent time 
savings over the “shotgun” times. This savings is in keeping with the comparison of 
equations 4.2 and 4.3 and it affirms that the communications overhead is the primary 
cause Of parallel run times being slower than sequential run times for this algorithm. 
2. Three Equation System 

Relative error tolerances for the three equation system were successively set at 
107" for ne= see eeelltne corresponding number of function evaluations ranged 
from 481 to 2418. Here, the “shotgun” run times were approximately twenty times the 
sequential run times, while the run times for the “train” scheme were approximately 
fourteen times the sequential run times. Again, the relationship between the 
communications overhead of the two parallel methods, as expressed in Equations 4.2 
and 4.5, is upheld. The “train” scheme, in fact, attained run times that were about two 


thirds the run times for the “shotgun” scheme. 
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5. Four Equation System 
The number of function evaluations performed in solving the four equation 
system ranged from 6/8 to 27/1. These values corresponded to relative error 
tolerances ranging from 107 to 10°7. “Shotgun” run times were approximately thirteen 
times sequential run times. Run times attained by the “train” scheme were seven to 
nine times those aitained by the sequential runs and resulted in a seventy percent time 
Savings over the “shotgun” scheme. This ts a better result than encountered in the 
three equation “train” implementation because of the fact that the topologies of each 
scheme required the same number of message pass time frames for one function 
evaluation while the number of communications required in the “shotgun” scheme 
increased by two for the four equation svstem. Equations 4.2 and 4.7 predicted a fifty 
percent time savings going from the “shotgun” to the “train” implementation. The fact 
that the actual resulting savings was twenty percent better can be explained by looking 
at the derivation of Equation 4.7. It was derived using an upper bound of four 
message pass time frames when in fact the required number is likely to be somewhat 
less than the upper bound due to the semisequential nature of the “train” schemie’s 
topology. 
4. Ten Equation System 
Pinas 10r the larcest=or the jour systems implemented, relative error 
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tolerances Were varied from 10°~ through 10°!9. The minimum number of function 


evaluations was 252 corresponding to 10°? while the maximum was 1639 for the error 


Vee “Shotgun” timing results were approximately seventeen times the 


tolerance | 
corresponding sequential results. whereas the “train” scheme results were a more 
respectable eight times the elapsed sequential run times. The savings over the 
“shotgun” results attained by the “train” scheme were from fifty-two to sixty percent of 
the “shotgun” times. These savings values are less than the theoretical savings 
predicted using Equations 4.2 and 4.9. Most likely, this disparity 1s due to the 
complexity of the topology of the ten equation “train” scheme as compared to the 
others. It is difficult to predict the time loss due to message collisions which occur at 
node 0 as the message “trains” return with their updated information. Therefore, the 
predicted number of elapsed time frames in Equation 4.9 is a “best guess” estimate and 


appears to be optimistic. 
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B. CONCLUSIONS 

The theoretical reduction of times required to solve systems of ordinary 
differential equations by a factor of n was not attained through parallel processing. In 
fact, times required to solve systems of equations using parallel algorithms on the 
hypercube were greater than those of sequential algorithms. It was found that this is 
due to the communications overhead inherent in internodal message passing. When 
this high overhead is coupled with the requirement of numerical integration techniques 
to combine updated integration information from each system component after each 
function evaluation. parallel processing becomes ill suited. Two possible solutions to 
this problem might be (1) increased communications speeds within the hypercube or (ii) 
a small amount of available common memory for all the cube’s nodes. This common 
memory would alleviate the need to artificially create it as was done by passing the 
“train” message from node to node and then transferring updated data in the “train” to 
a buffer at the driver node. The lack of any shared memory negated the concurrent 
processing of a system’s components at the nodes by requiring message passes back to 
the driver node. Therefore, it is concluded that, due to the communications overhead 
encountered in conjuction with the manner in which systems of ordinarv differential 
equations are numerically solved. parallel processors with totally distributed memory 
are not suited for the solution of systems of ordinary differential equations. 

Additionally, several general conclusions may be drawn about parallel processing 
on the hypercube. First, the “natural” conversion of an existing sequential algorithm 
to a parallel algorithm may not be the best choice. This point was clearly supported in 
the failings of the “shotgun” implementations compared to the “train” implementations. 
Secondly, it is imperative to consider internodal adjacency and distances when 
developing parallel algorithms. This was evidenced by the empirical determination of 
internodal message pass times. Finally, parallel processing in itself is not a panacea. It 
is Well suited and affords large time savings to many applications; however, it has been 
shown here that for at least one application. parallel processing causes a significant 


tune loss over sequential processing. 
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Results for the Two Equation System. 
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Figure 5.2 Results for the Three Equation Svstem. 
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Figure 5.3 Results for the Four Equation System. 
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Figure 5.4 Results for the Ten Equation System. 
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APPENDIX A 
RUNGE-KUTTA-FEHLBERG COEFFICIENTS 


TABLE 1 
Ree ee aoe Ebpe RG COEEPICIENTS 




















| 25/216 | 
1/4 1/4 0 0 | 
3/8 3/32 9/32 6656/12825 1408/2565 | 
| L277 is 1932/2197 ~-7200/2197 (2907/2197 28561/56420 2197/4104 | 
1 | 439/216 =¢ 3680/5113 -845/4104¢ = 97 50 —175 
172 -8/27 2 -3544/2565 1859/1404 -11/40 2755 0 
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APPENDIX B 
SUBROUTINE FEHL 


SUBROUTINE FEHL 
FEHLB&RG FOURTH-FIFIO ORDER RUNGE-KULTA Mei neD 


FEHL INTEGRATES A SYSTEM OF NEON FIRST ORDER 
ORDINARY DIFFERENTIAL EQUATIONS OF THE FORM 

DY(I)/DT=F(T,Y(1),...,Y¥(NEON) 
WHERE THE INITIAL VALUES Y(I) AND THE INITIAL DERIVATIVES 
YP(I) ARE SPECIFIED AT THE STARTING POINT T. FEHL ADVANCES 
THE SOLUTION OVER THE FIXED STEP H AND RETURNS 
THE FIFTH ORDER (SIXTH ORDER ACCURATE LOCALLY) SOLUTION 
APPROXIMATION AT T+H IN ARRAY S(I). 

/F5 ARE THE ARRAYS OF DIMENSION NEQN WHICH ARE NEEDED 

FOR INTERNAL STORAGE. 
THE FORMULAS HAVE BEEN GROUPED TO CONTROL LOSS OF SIGNIFICANCE. 
FEHL SHOULD BE CALLED WITH AN H NOT SMALLER THAN 13 UNITS OF 
ROUNDOFF IN T SO THAT THE VARIOUS INDEPENDENT ARGUMENTS CAN BE 
DISTINGUISEED. 


INTEGER NEON K 
REAL Y(NE N) /T,H,CH, geese ae F2(NEON), 
1 F3(NEON),F4(NE EON), F5(NEON) , S(NEON 


CH=H/4.0 
DO 221 K=1,NEQN 

2 F5(K)=¥(K) +CH*YP(K) 
CALL F(T+CH,F5,F1) 


CH=3 .0*H/32.0 
DO 222 K=1,NEOQN 

222 FS(K)=Y(K +Che (YP (K) +3. O*F1(K 
CALL F(T+3.0*H/8.0,F5,F2) 


CH=H/2197.0 
DO 223 K=1,NEOQN 

223 F5(K)=Y(K)+CH*(1932. OAYP (K}+ (7296. ,O*F2(K)-7200.0*F1(K))) 
CALL F(T+12.0%H/13.0,F5,F3 


CH=H/4104.0 
DO 224 K=1,NEQN 
224 F5(K) AY (K) + CHS ( (8341. O*YP(K)-845.0*F3(K))+ 
1 944 0 eee -32832.0*F1(K))) 
CALE Mer F5,F4 


CH=H/ 20520. 0 
DO 225 K=1 NEQN 
225 F1(K)= =Y (K)+CH ((-6080 .O*YP(K)+((9295.0*F3(K)- 
OnFa(K)) 14 (41040. .OXF1(K)-28352.0*F2(K))) 
CALL ETH) 2.6 Fl FS 
COMPUTE APPROXIMATE SOLUTION AT T+H 
CH= H/7618050 0 


NEQN 
Zou S(Kz =e (K) +CHE (902880 . OA YP(K)+( 3855 735m 0 eo he 


1 1371249. sQRFa(K) })+ (398 64. 0*%F2(K)+ 
2 277020.0*F5(K) 
RETURN 


END 


APPENDIX C 
IPSC CONCURRENT SUPERCOMPUTER TECHNICAL DESCRIPTION 


Implementation of the Runge-Kutta-Fehlberg method in this thesis was 
performed using an Intel Scientific Computer 1PSC Concurrent Supercomputer. The 
basic system consists of two elements, a cube manager and a cube, as depicted in 
micUre Cale. 





Figure C.} 32-Node 1PSC Concurrent Supercomputer. 


The cube manager 1s a desktop programming station that provides programming 


support and system management. It consists.of an Intel System 310AP Multibus- 
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based computer using an [ntel 80286 central processing unit and an Intel 80287 
numeric processing unit. It also contains a 5 1/4” 140 megabyte Winchester disk, a 
320K byte floppy disk, a 45 megabyte cartridge tape, and a 2 megabyte ECC RAM 
memory. Additionally, it is equipped with an integrated Ethernet interface for 
communicating with the cube, and an alphanumeric terminal for input/output. Cube 
manager software consists of a UNIX-based programming and development 
environment with FORTRAN, C, Assembler, cube control utilities and 
communications, and system diagnostics. 

The cube is a complete ensemble of microcomputers connected in a parallel 
architecture. Each microcomputer, along with its own numeric processing unit and 
local memory is referred to as a “node.” Nodes are connected together by high-speed 
communication channels to form a self-contained “cube” in a free-standing enclosure. 
Each node in the cube is an independent, single-board computer. The node contains 
an Intel 80286 central processing unit and its companion 80287 numeric processing 
unit. The node also contains 512K bvtes of NMOS dynamic RAM and 8 bidirectional 
communication channels managed by dedicated 82586 communications coprocessors. 
Cube software consists of a monitor and kernal residing on each node. The monitor is 
contained in PROM and the kernal is loaded into node RAM after successful 
initialization. [Ref. 10] 

The interconnection scheme, or topology, for the iPSC is a “binary n-cube” or 
“hypercube.” The dimension n refers to the power of two corresponding to the number 
of nodes in the cube. In the casegof the 1PSC computer used im this thesis) the mumiben 
of nodes is 16; thus, making it a 4-dimensional hypercube, as depicted in Figure C.2. 

Within a 4-dimensional hypercube, each node has 4 nodes adjacent to it. The 
distance between a node and one of its adjacent nodes 1s defined to be 1. Additionaily, 
there are 6 nodes with distance 2, 4 nodes with distance 3, and 1 node with distance 4 
from any given node in the 4-dimensional hypercube. These internodal distances must 
be considered when emploving parallel algorithms, in order to minimize distances over 


which messages are passed. 
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APPENDIX D 


PROGRAM LISTING EXCERPTS FOR THE TWO EQUATION 


SHOTGUN SCHEME 


The following listing 1s an excerpt from the node 0 program for the Shotgun 


implementation of the two equation system. 


220 


eeu 


Zaz 


Zea 


224 


220 


226 


227 


220 


a2o 


Zo0 


Zor 


£5(k)= yk 


but, love) 


do 222 WA 
call sendw(ci,20,£5,len,k,1) 


do 223 k=1,2 
call recvw(ci,15,z,4,cnt,frnode, frpid) 
£5(frnode)=z 


do 224 k=1, 
call Senaueae 80 Seem ale: 


do 225 k=1, 
Cale ae 15,2,4,cnt,frnede, road 
£5(frnode )=z 


do 226 k=1,2 
call sendw(ci,40,£5,len,k, 1) 


do 227 k=1, 
call eek 15,2,4,ent, trnode, troia) 
£5(frnode)=z 


do 228 k=1, 
Call eet ee 50 eS hemes) 


do 229 k=1, 
call ed 15,2,4,7GnE/Gneden «rp Ld) 
f1(frnode)=z 


do 230 k=1, 
Cand See 60, f£1,len,k 41) 


do 231 k=1,2 
call recvw(ci,15,zz2,8,cnt, frnode, frpid) 
a NEL Z201 

eee (frnode )=z2z(2 


The following listing is the node 1 and node 2 program for the Shotgun 
implementation of the two equation system. 


program nodeZeq 
Sais «Seamer nodes 1 and 2...... Zecetiation SyStemoe.. 


integer chan,copen,nodeid,cnt,frnode 
dimension buf(4 anes irs 2 
Beence (OUrt tie our(2),h),(but(3),a),buf(4),b) 


chan=copen(mypid() ) 
AOdelasa cde) 


LO Vea bi recvwiician,. ©, Dur, 16,cnt, frnode, frpid) 
tp=tth/4.0 
Call recvw(chan,20,d5,8,cnt,frnode, frpid) 
call “Bene thes Oke nodeid) 
Z=at+3.0*h*(b+3.0*wl1)/32.0 


Galil seaca age ae 
Ep ets..07n/ o.0 
Gulenec vw, (ecnanns0 do .>5,cent,irnode,ifrpid) 
Camieenode( tp Go we medeird 
Z=ath*(1932.0*b+(7296.0*w2-7200.0%w1))/2197.0 


Gall séndw(chan,15/274-0,1) 

tp=t+12.0*h/13.0 | 

eal ee ees ccs et 

call fnode(tp,d5,w3,nodeid) 

z=ath* ( (8341 .0%b-845 .0%w3 )+(2944.0%*w2-32832.0%*w1))/4104.0 


call sendw(chan,15,z,4,0,1) 

tp=t+ 

Gal). een node trp2e) 

Gali fnode ET ae oee Ae 

Z=ath*((-608 pO a eeee 05-5643 07N4 ) ) 
al +(41040 .0%w1-28352.0%w2) )/20520.0 


call sendw(chan,15,z,4,0,1) 
Ep=eth/ 2.0 
eal recvalenan,c0,d5,6,cnt,tfrnode,frpid) 
gan tnodet #,.d5,ws,node1 
zz(1)=ath*( (302880 .0*b 
1 + 811719204). 
Zi mess ssoC4.0-02127 7020 .07w5 ))/7618050.0 
zz(2)=abs((-2090. 0*b+(21970.0%w3-15048.0%*w4) ) 
+(22528.0%w2-27360.0%*w5) ) 
eal -senew chan, 15,22,98,0,1) 


Gor co 10 

end 

subroutine fnode(tp,d,v,nodeid) 
dimension ) 

Gemee versa.) noderd 


1 Be=22 G1 =dil )*d(2)) 
return 

2 v=-(d(2)-d(1)*d(2)) 
return 


end 
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APPENDIX E 


PROGRAM LISTING EXCERPTS FOR THE TWO EQUATION FLIP- 
FLOP SCHEME 


The following listing is an excerpt from the node 0 program for the Flip-flop 


implementation of the two equation system. 


220 mete 


ch=h/4.0 
Cc 
buf£(3)=y(1) 
bus (4)= Te | 
call sendwttes, 10, bute, le, 
buf gay 2) 
buf(4)=yp(2) 
ail sendaqed, 10 sptiee Geert 
G 
ao 22) "k=tew 
221 £5(k)=y(k)+ch*yp(k) 
c 
call sendwtei, 205 45 tena) 
e 


do Za k=1992 

call recvw(ci,15,zz,8,cnt, frnode, frpid) 

if (frnode .eq.1) then 
f1(frnode)=2z2(1) 
eee(frnode)= 22(2) 

else 
frnode=frnode- 
£1(frnode)= Sein 
eee (frnode )=zz(2) 

end if 

231 continue 


The following listing is an excerpt from the node | program for the Flip-flop 


implementation of the two equation system. 


10 call Onan 20 ,¢ ee frnode,frpid) 
wl=2.0%(d et (2)) 
z=at3.0*h*(b+3. 5 we {320 


call sendw(chan,23,buf1,12,3,1) 
call recvw pean 32 ,bufl,12,cnt, frnode, frpid) 


pe tai la .0*h/8 

wW af): te ~ lez 
ae ae 1932.07b+ 
eal sendy (chan Ae 
call sendw(chan,43,d 
‘Dat t+12. a= 

W 5 Okla -d(1)*d(2)) 


ode. O*%w2=7200,07W1) 7 Zia 
ce ,4,ent, Eynode, fypica) 


Aa) 
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d(1)=ath*( (8341 .0%*b-845.0%w3)+( 29440 .0*w2-32832.0*w1))/4104.0 
call recvw(chan,42,d ae 7s Soe rnode,frpid) 
Call sendw(chan,43,d{ nay 


tp=t+h 

W4=2. -94(4(1)24(1)44(2)) 
a= ath* ((-6080.0*b+(9295 .0*w3-5643 .0*w4) ) 
1 +(41040 .0%w1-28352. 0*w2))/2052 207.0 

call Eaportenaa 42, ae ,4,cnt, frnode, frpid) 
call sendw(chan,43,d aol) 


a “O*(a01)-~ 4(1)*d(2)) 
2z(1)= ath*( (902880 
1 (2655755. ee AAG). O*w4) ) 
Z +(3953564.0%w2+277020. O*%wS) yeol soo 50.0 
Za) = Cres -2090.0%b+(21970.0%*w3-15048. Oxw4) ) 
1 22528 .0*w2-27360.0%w5)) 
call nee wieman,15 22,0, 0,1) 


goveco 10 


The following listing 1s an excerpt from the node 3 program for the Flip-flop 


implementation of the 2 equation system. 


10 call eC ee 20, Gsyent.frnode, tropic) 


to=tth/4.0 
Camierecu. chan,23,butl,12,cnt, frnode, frpid) 
wi=-(8(2)- (1)*d(2)) 
oo 3.0*h*(b+3.0%wl)/32.0 
d =bufl (3) 


Cale Seney Span S27 bul we 1.1) 


=t+3. 
oe - (ate ynaij ea 2) 
one Oe ese. 0 Buea nOewZ=7200.0-41))/2197.0 
Can sees Sean ,42 ie ie 

call recvw(chan,43,d 4,cnt,frnode, ic tao cl) 
: a7 73 on aT 3ae a 

1)* 

a2) eat ee O* Nae 845 .0*w3)+(29440.0*w2-32832.0%*w1))/4104.0 
cau ate 


chan,42, (2) .4. 1. ak 
call recvw(chan,42,d(1),4,cnt,frnode,frpid) 


tp=tth 
was—(d(2)- -4(1)*4(2)) 
d(2)=ath*( (-6080 .0*b+(9295.0*w3-5643.0%*w4)) 
1 +(41040 .0*w1-28352. o*w2) )/20520. 0 
call pene eran: ,42, Baie US ak 
call recvw(chan,43,d 4,cnt,frnode, frpid) 
tp=tth/2.0 
w5=-(d(2)- Ctadagao. 2 





22Z2(1)=ath*((902880.0*b 
al Pioooo1s5.0°Wo-1371249. eat) 
Z mee 5004 .0~w2+277020.0*%ws ) )/7618050.0 
22(2)=abs((-2090.0*b+(21970.0%w3-15048.0%*w4) ) 
il Esae ZOO. O*w2- Z7 5008 O*WS) ) 
call sen w(chan, ior ZZ 2600, | ) 


go to 10 


APPENDIX F 


PROGRAM LISTING EXCERPTS FOR THE THREE EQUATION TRAIN 
SCHEME 


The following listing is an excerpt from the node 0 program for the Train 


implementation of the three equation system. 


220 Bee 
(2)=h 
e 
ch=n/4.0 
do 222 k=1,3 
but ey 
but ( k+5 yp ) 
222 buf(k+8)=£5(k) 
C 
do 224 ijk=1,4 
Call send (ci, 10, buf,68, 1,1 
call sendvccer. 10 bur Ge 2 
C 


GO" 223. 1i= ie 
Gala recvw(ci, LO: ey bo ent, f£rnode, £rpaa) 
if (frnode eq. 2 then 
buf (10)=bué Crs) 


ls 
bus (9)5 =buf1(12) 
buf(11)=buf1l(14) 
end if 


223 continue 
224 continue 


Call Scheer 10 but 6c Ba 
call sendw(ci1,10, ‘buf, Oe2 >i 


do 225 i11=1,2 
call recvw(ci,10, orn Bee ee frnode,frpid) 
if (frnode . eq. 2 
£1(2)=bufl1 ( ay” 
eee(2)=buf1(16) 


7G ! )\=pueiee 
21(3 )=bufl( 14) 
cect ae 


end if 
225 GOnelnue 


The following listing is the component node for the Train implementation of the 


three equation system. 


nodp2=nodeid+2_ 
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nodp5=nodeid+5 
nodpll=nodeidtll 
nodp14=node1id+14 


TOeGamierecvw( chan, 10,ouL ben cntuamunode, frpid) 
tp=tth/4.0 
a=buf nodp2} 
b=buf(nodp5 
call fnode(tp,d5,wi,node id) 
buf (nodp11)= =at3. O*h* (b+3. O*w1)/32.0 


ead. ea saan 10,buf,len,4,ndest,1) 
to=t+3.0*h 

call a 10,buf,len,cnt, frnode, frpid) 

call fnode(tp,d5, w2,nodeid 

buf (nodpll1)= Eth (1932. Oxb+ (7296. O*w2-7200.0%w1))/2197.0 


ead). Scone 1p but lien,ndest,1) 
bo=tri2.04h/ 15 
Gabe cecuw (chan, HOPeotesteny ent, -rnode, frpid) 
Gali rnoede( tp, d5., w3, nodeid 

eu tae o aothe( (8341. O%b-845 .0%w3)+(2944.0*w2-32832.0%*w1)) 


coo POE VSI, 10,buf,len,ndest,1) 

Co= 

Zaria Bore ener PO; but ,lenyent,frnode,frpid) 

Caliernocde ED .do, w4,nodei 

Bab. at+h* ( (-6080 .0*b+(9295.0%*w3-5643.0%*w4) ) 
040 .0*w1-28352. 0*w2))/20520. 0 


Call semen(chan, Omemr, len, ndest 1) 
tp=tt+h/2.0 
call recvw(chan 10,buf,len,cnt,frnode,frpid) 
Gan Enodet to ds. w5 snode id) 
See rea =ath*( (902880. O*b 
55735 .0%w3-1371249.0*w4 sy) 
=(3853664 0%w2+277020.0%w5) )/7618050 
a. oe abs ((- -2090. 0%b+( 21970. O*w3- 15048. O*w4 ) ) 
2528 .0*w2-27360.0*w5) ) 
call eee ha PO-Dpueelen naest, 1) 


A 


Nr 


go to 10 


end 


subroutine pee d,v,nodeid) 
dimension d(3) 


Gomer! ,2Z,5), nodend 


1 v=- ee) 
return 


Z Vesta, 12) "a 3) 
retur 


3. v=d(2)-d(3) 


return 
end 
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